The Coalescent

Eric C. Anderson

Conservation Genomics Workshop, Monday August 28, 2023

Overview

Goal: Introduce the Coalescent and convey why it is a crucial model for understanding genetic variation

Outline:

  • Deriving the coalescent from the Wright-Fisher model
    • R-notebook interlude: the exponential and geometric distributions
  • Expected properties of coalescent trees
    • Expected interval times
    • Expected total branch lengths
  • Shapes of the coalescent under different demographic scenarios
    • Hands-on: simulating and visualizing coalescent trees
  • Mutations on the coalescent
    • Group activity: “Find your branch”
  • The site frequency spectrum
    • Hands-on: simulating 1-D SFS from different demographic scenarios

The Wright Fisher model

A simple model for the random process of genes getting from one generation to the next in a population

Assumptions

  • Population of constant size
  • No selection
  • Haploid organisms (diploids are treated) as \(2N\) haploids
  • No sexes
  • Next generation is obtained by sampling from replacement from amongst the gene copies of the previous generation

The Wright Fisher model. 40 diploids for 10 generations

The Wright Fisher model. Same thing, but re-ordered

The Wright Fisher model. Colored Alleles

The Wright Fisher model. Focus on a sample

A Model With Higher Variance in Repro. Succ.

Lineages Coalesce Faster, here

Looks like a Smaller Wright-Fisher population

  • Changes in allele frequencies (or merging of lineages) in that non-WF population happens at the same rate as you would expect in a Wright-Fisher population of smaller size.

  • The effective size of a real populations (\(N_e\)) is the size of an ideal (i.e., Wright-Fisher) population that has similar “genetic behavior” as the real population.

  • So, results we obtain with the W-F model can be applied to real populations by using their effective size.

Lineages in a Bigger Population for more Generations

The Coalescent Describes The Random Process of Lineages Finding Common Ancestors

  • Focuses only on the properties of sampled genes/sequences
  • Need not consider all the grey individuals (great for simulation)
  • Particularly useful for sequence data
    • So, think of each of these colored balls as a segment of DNA being copied and handed down from one generation to the next.
  • Is pretty easy to derive from the Wright-Fisher model
  • Ultimately, understanding the coalescent helps you to understand how different demographic or evolutionary effects will change the genetic data you expect to see from populations that you study

Let’s derive the coalescent from the W-F model

  • Start small—focus first on a sample of two gene copies:
  • Consider two gene copies and trace their lineages backwards in time:
  • Terminology: sampled gene copies in the present become lineages that travel through gene copies present in previous generations
  • Two lineages coalesce in the generation that their common ancestor lived in.

Simple probabilities

Let there be \(2N\) haploids (\(N\) diploids) in a Wright-Fisher population.

The probability that two lineages coalesce (arose from a common ancestor) one generation in the past is: \[ \frac{1}{2N} \] The probability that they don’t coalesce one generation in the past is simply: \[ 1 - \frac{1}{2N} \] So, the probability that two lineages coalesce after \(t\) generations is \[ \biggl(1 - \frac{1}{2N}\biggr)^{t - 1}\biggl(\frac{1}{2N}\biggr) \]

This is a geometric distribution.

The geometric distribution is well-approximated by an exponential distribution with the same mean

With \(k>2\) lineages, we wait until the first pair coalesces, and then repeat with one less lineage

  • With \(k\) lineages there are \(\frac{k(k-1)}{2}\) pairs of lineages
  • Each pair of lineages is independently1 waiting to coalesce as we go back in time.

\[ \circ~~~~~~~~~~~~ \circ~~~~~~~~~~~~ \circ~~~~~~~~~~~~ \circ~~~~~~~~~~~~ \circ~~~~~~~~~~~~ \circ~~~~~~~~~~~~ \circ~~~~~~~~~~~~ \circ~~~~~~~~~~~~ \circ~~~~~~~~~~~~ \circ \]

The time to the coalescence of the first coalescing pair has an exponential distribution with mean: \[ 2N\frac{2}{k(k-1)} = \frac{4N}{k(k-1)} \] After that first pair coalesces, the two lineages involved become a single lineage and then the process waits for the first coalescence between a pair of \(k-1\) lineages.

And so forth.

The last two extant lineages coalesce into the “most recent common ancestor” (MRCA) of the sample

The Anatomy of a Coalescent Tree

  • Let \(T_k\) be the length of time (number of generations) during which there are \(k\) extant lineages in the sample.
  • To the left, \(T_{10}\) is the time until the first pair of sampled gene copies coalesces.
  • \(T_\mathrm{MRCA}\) is the time to the most recent common ancestor.
    • i.e., the time it takes for all the lineages to have coalesced
  • Recall: \[ \mathbb{E}T_k = \frac{4N}{k(k-1)} \]

Vertical Lines are Generations

  • I like to think of the vertical lines (the branches) in the coalescent as strings of beads—each bead is a generation, and there can be a lot of them.

  • Each generation = a meiosis in the lineage

  • What can happen during meiosis? (mutation!)

  • Neutral mutations do not affect the shape of the tree

  • This is the neutral coalescent. (Selection renders things much more difficult.)

Expected Properties of the Coalescent with \(n\) tips

Expected time to the MRCA

\[ \begin{aligned} \mathbb{E}T_\mathrm{MRCA} &= \mathbb{E}\biggl[T_2 + T_3 + \cdots + T_n\biggr] \\ &= \sum_{k=2}^n \frac{4N}{k(k-1)} \\ &= 4N \sum_{k=2}^n \biggl( \frac{1}{k-1} - \frac{1}{k}\biggr) \\ &= 4N\biggl(1 - \frac{1}{n}\biggr) \end{aligned} \]

  • Wow! Even with a huge sample of gene copies (large \(n\)), the expected time to the MRCA is no more than twice that of a sample of 2 gene copies.

Expected Properties of the Coalescent with \(n\) tips

Expected total branch length

\[ \begin{aligned} \mathbb{E}T_\mathrm{Tot} &= \mathbb{E}\biggl[2T_2 + 3T_3 + \cdots + nT_n\biggr] \\ &= \sum_{k=2}^n k\frac{4N}{k(k-1)} \\ &= \sum_{k=2}^n \frac{4N}{(k-1)} \\ &= 4N\sum_{k=1}^{n-1} \frac{1}{k} \end{aligned} \]

This is the expected number of opportunities (generations/meioses) for mutations to occur.

The Neutral Coalescent with Varying Demography

We have already noted that non-neutral mutations violate the assumptions of the coalescent.

However, many demographic scenarios can be accommodated in the coalescent framework1. E.g.:

  • Population growth / decline
  • Population structure:
    • Migration
    • Population splitting
    • Population joining

Each of these scenarios affect the shape of the coalescent tree.

How Might Population Size Changes Affect the Tree?

  • Imagine that the population was larger in the past. How do you expect that to affect \(T_{10}\)? What about \(T_2\)?

  • Now, what if the population was smaller in the past?

  • We have a short hands-on that lets us simulate, and then visualize, coalescent trees under different demographic scenarios.

Link can be found at